[,1]
datetime "2023-09-29 23:44:23,845"
project "wgs315"
filename "Diag-wgs315-HG81729774-PM.HaplotypeCaller.bam"
bytes "12588570"
seconds "0.3"
speed "40990000"
[,1]
datetime "2023-09-19 03:19:08,046"
project "Test5IlluminaDNA"
filename "tHG26910130IDNA300B-Custom-KIT-wgs_S21_R1_001.fastq.gz"
bytes "34683405843"
seconds "404.4"
speed "81800000"
[,1]
datetime "2023-09-29 17:56:47,685"
project "wgs315"
filename "Diag-wgs315-HG11546853-PM-DR.final.vcf"
bytes "1197088791"
seconds "17"
speed "67270000"
[,1]
datetime "2023-10-24 08:52:04,164"
project "wgs323"
filename "Diag-wgs323-HG16194079C1242.sample"
bytes "1562"
seconds "0"
speed "49540"
[,1]
datetime "2023-10-06 12:58:35,024"
project "wgs317"
filename "231004_A00943_0753_AHJ7HFDSX7.HG83540655-BevegForst-KIT-wgs_S6_R2_001.qc.pdf"
bytes "117876"
seconds "0.1"
speed "1772160"
1 Background
GDx at OUSAMG is planning to upscale the WGS production to 4 x 48 samples or 2 x 48 + 1 x 96 samples per week.
This document evaluates the possible bottlenecks of IT & bioinformatics pipelines in following areas:
- Data transfer speed
- Data storage
- Pipeline capacity (Illumina DRAGEN)
2 IT && Bioinformatics
2.1 Data transfer speed
2.1.1 Data Collection
To evaluate the data transfer speed, we collected the transfer time for each file transferred from NSC to TSD between 2023-09-01 08:41:40 and 2023-11-28 13:26:06.
2.1.2 Overview
filesize
Min. 0.0 B
1st Qu. 428.0 B
Median 9.3 KiB
Mean 1.5 GiB
3rd Qu. 968.0 KiB
Max. 100.9 GiB

speed(/s)
Min. 1.0 B
1st Qu. 12.0 KiB
Median 288.1 KiB
Mean 12.2 MiB
3rd Qu. 8.4 MiB
Max. 93.1 MiB

seconds
Min. : 0.00
1st Qu.: 0.00
Median : 0.00
Mean : 19.56
3rd Qu.: 0.10
Max. :2084.40

2.1.3 Plots
2.1.3.1 Transfer speed and time VS file size (all files)
2.1.3.2 Transfer speed and time VS file size (small files)
2.1.3.3 Maximum transfer reached around 200MB file size?
2.2 Data storage
WGS produces large amount of data. The data storage capacity is critical for the upscaling.
2.2.1 NSC
On NSC side, the data is stored in on boston at /boston/diag. Boston has a total capacity of 1.5 PB, and the usable capacity is 1.2 at the moment.
2.2.2 TSD
On TSD side, the data is stored in /cluster/projects/p22. The total capacity is 1.8 PB, and the usable capacity is 1.2 PB at the moment.
2.3 Pipeline capacity (Illumina DRAGEN)
Illunima DRAGEN is a bioinformatics pipeline server that can be used to process WGS data. It takes around 1 hours to process a 30x WGS sample.